Fix parsing of non-finite values #3942

mjmckp · 2021-02-12T03:17:55Z

Proposed solution for #3941. Current implementation silently returns zero when input string is "inf", "-inf", or "nan" when compiled with VS2017, so instead just explicitly check for these values and fail if there is no match. No attempt to optimise string allocations in this implementation since it is usually rarely invoked.

…datasets. Prior to this change, the line "score_t threshold = tmp_gradients[top_k - 1];" would generate an exception, since tmp_gradients would be empty when the cnt input value to the function is zero.

…data set given as of array of pointers to rows (as opposed to existing method LGBM_BoosterPredictForMat which requires data given as contiguous array)

…urns zero when input string is "inf", "-inf", or "nan" when compiled with VS2017, so instead just explicitly check for these values and fail if there is no match. No attempt to optimise string allocations in this implementation since it is usually rarely invoked.

StrikerRUS · 2021-02-12T11:59:57Z

@AlbertoEAF Could you please help to review?

AlbertoEAF

Great catch!

C++98 does not define aspect of inf/nan values in text, but C99 does, and so, unlike stod, apparently stringstreams might not work in all platforms (specially MSVC) to parse inf, and nan values: https://www.boost.org/doc/libs/1_70_0/libs/math/doc/html/math_toolkit/fp_facets/facets_intro.html
The old implementation used stod (which is not locale-safe for our purposes): https://en.cppreference.com/w/cpp/string/basic_string/stof. See the old version of of this very function here: 792c930#diff-ddbdb79f73cc159561b86b759b0542fec75ca9f4869a7eac5790e8f9ebaf7ef4R382
We shouldn't stop parsing numbers when fast_double_parser::parse_number fails. There might be some numerical values that besides inf, or nan which are not parsed either. Hence we should keep the stringstream parsing for numerical values on the slow path instead of giving a fatal error.

AlbertoEAF · 2021-02-13T17:19:49Z

include/LightGBM/utils/common.h

-      std::stringstream ss;
-      Common::C_stringstream(ss);
-      ss << str;
-      ss >> tmp;


Keep this code on the else branch instead of raising a fatal error.

But that code silently fails, which is completely unacceptable.

In reality we are only parsing strings generated by LightGBM itself when the model was written out to file, so we should ensure there are robust round-trip tests which include models that contain inf and nan values. There are no other possible non-finite values defined for IEEE 754 floating point numbers, and if at some point in the future the standard changed, this would be quickly picked up by the round-trip tests and easily addressed.

I am shocked that such tests don't already exist.

I meant doing number parsing after the nan and inf checks. Your logic makes sense regarding IEEE-754, although I wouldn't recommend dropping that parsing at the end without adding said tests first, not 100.0% sure fast_double_parser parses all numbers.

AlbertoEAF · 2021-02-13T18:30:35Z

include/LightGBM/utils/common.h

+        tmp = std::numeric_limits<double>::infinity();
+      else if (strlower == std::string("-inf"))
+        tmp = -std::numeric_limits<double>::infinity();
+      else if (strlower == std::string("nan"))


Important: Missing -nan handling.

Probably best to halve the string comparisons by parsing first the "-" sign.

Ok thanks, I'll add -nan. I don't think it's worth obfuscating this rarely executed branch with optimisations until profiling shows it is a bottleneck.

AlbertoEAF · 2021-02-13T18:37:56Z

include/LightGBM/utils/common.h

-      ss << str;
-      ss >> tmp;
+      std::string strlower(str); 
+      std::transform(strlower.begin(), strlower.end(), strlower.begin(), [](int c) -> char { return static_cast<char>(::tolower(c)); });


Great clean code @mjmckp ;)

Instead of allocating a string, and since you already have a lambda, what about defining a case-insensitive comparison lambda and use std::equal to check the "inf" and "nan" values below?

Although this is the rare branch there might be longer strings than inf or nan which might be parsed here and might slow down our parsing without need.

Example: https://thispointer.com/c-case-insensitive-string-comparison-using-stl-c11-boost-library/

Again, this hardly seems worth it, this branch is rarely invoked, meanwhile a colossal amount of strings are being allocated in splitting and parsing the input file, so these few extra allocations are a drop in the ocean.

I think our time is better spent adding robust round-trip tests to ensure major bugs like this don't occur again...

I think our time is better spent adding robust round-trip tests to ensure major bugs like this don't occur again...

Agreed. Will you add such tests?
I actually run such tests but on an external lgbm provider and didn't have nan nor inf on my model, but would prefer to see them in lgbm's CI so any breakage is detected immediately.

I could assist in adding the tests, any idea where these should go and how best to implement them?

@mjmckp #3555 has been just merged. We will really appreciate new GTest-compatible tests in this or in a follow-up PR.

Sure, would you mind pointing me towards a similar kind of test that I can use as a starting point please?

Unfortunately, we don't have any tests yet. This is something that we should concentrate on in the near future. For now, I think you can take a look at tests from @AlbertoEAF in #3997.
https://github.com/microsoft/LightGBM/pull/3997/files#diff-c363eba6eda99d9e560f8341a1fc8fe02e885d2256db2482e1c543430a25666d

@mjmckp I've merged this PR with the aim to not delay the upcoming release. Please feel free to add tests in a new PR. We'll be very grateful! And thanks a lot for the bug fix!

@StrikerRUS Thanks a lot, I'll get up to speed on how the new tests work and add some tests for this in a new PR soon.

…into fix-doubleparsing

include/LightGBM/utils/common.h

Remove trailing whitespace to pass linting tests Co-authored-by: Nikita Titov <[email protected]>

StrikerRUS · 2021-03-12T22:10:16Z

@shiyu1994 @btrotta @guolinke Could you please review this to include in the upcoming release?

guolinke

LGTM

* [docs]Add alt text on images * Update docs/GPU-Windows.rst Co-authored-by: James Lamb <[email protected]> * Update docs/GPU-Windows.rst Co-authored-by: James Lamb <[email protected]> * Apply suggestions from code review Co-authored-by: James Lamb <[email protected]> * Apply suggestions from code review Co-authored-by: James Lamb <[email protected]> * Merge main branch commit updates (#1) * [docs] Add alt text to image in Parameters-Tuning.rst (#4035) * [docs] Add alt text to image in Parameters-Tuning.rst Add alt text to Leaf-wise growth image, as part of #4028 * Update docs/Parameters-Tuning.rst Co-authored-by: James Lamb <[email protected]> Co-authored-by: James Lamb <[email protected]> * [ci] [R-package] upgrade to R 4.0.4 in CI (#4042) * [docs] update description of deterministic parameter (#4027) * update description of deterministic parameter to require using with force_row_wise or force_col_wise * Update include/LightGBM/config.h Co-authored-by: Nikita Titov <[email protected]> * update docs Co-authored-by: Nikita Titov <[email protected]> * [dask] Include support for init_score (#3950) * include support for init_score * use dataframe from init_score and test difference with and without init_score in local model * revert refactoring * initial docs. test between distributed models with and without init_score * remove ranker from tests * test value for root node and change docs * comma * re-include parametrize * fix incorrect merge * use single init_score and the booster_ attribute * use np.float64 instead of float * [ci] ignore untitle Jupyter notebooks in .gitignore (#4047) * [ci] prevent getting incompatible dask and distributed versions (#4054) * [ci] prevent getting incompatible dask and distributed versions * Update .ci/test.sh Co-authored-by: Nikita Titov <[email protected]> * empty commit Co-authored-by: Nikita Titov <[email protected]> * [ci] fix R CMD CHECK note about example timings (fixes #4049) (#4055) * [ci] fix R CMD CHECK note about example timings (fixes #4049) * Apply suggestions from code review Co-authored-by: Nikita Titov <[email protected]> * empty commit Co-authored-by: Nikita Titov <[email protected]> * [ci] add CMake + R 3.6 test back (fixes #3469) (#4053) * [ci] add CMake + R 3.6 test back (fixes #3469) * Apply suggestions from code review Co-authored-by: Nikita Titov <[email protected]> * Update .ci/test_r_package_windows.ps1 * -Wait and remove rtools40 * empty commit Co-authored-by: Nikita Titov <[email protected]> * [dask] include multiclass-classification task in tests (#4048) * include multiclass-classification task and task_to_model_factory dicts * define centers coordinates. flatten init_scores within each partition for multiclass-classification * include issue comment and fix linting error * Update index.rst (#4029) Add alt text to logo image Co-authored-by: James Lamb <[email protected]> * [dask] raise more informative error for duplicates in 'machines' (fixes #4057) (#4059) * [dask] raise more informative error for duplicates in 'machines' * uncomment * avoid test failure * Revert "avoid test failure" This reverts commit 9442bdf. * [dask] add tutorial documentation (fixes #3814, fixes #3838) (#4030) * [dask] add tutorial documentation (fixes #3814, fixes #3838) * add notes on saving the model * quick start examples * add examples * fix timeouts in examples * remove notebook * fill out prediction section * table of contents * add line back * linting * isort * Apply suggestions from code review Co-authored-by: Nikita Titov <[email protected]> * Apply suggestions from code review Co-authored-by: Nikita Titov <[email protected]> * move examples under python-guide * remove unused pickle import Co-authored-by: Nikita Titov <[email protected]> * set 'pending' commit status for R Solaris optional workflow (#4061) * [docs] add Yu Shi to repo maintainers (#4060) * Update FAQ.rst * Update CODEOWNERS * set is_linear_ to false when it is absent from the model file (fix #3778) (#4056) * Add CMake option to enable sanitizers and build gtest (#3555) * Add CMake option to enable sanitizer * Set up gtest * Address reviewer's feedback * Address reviewer's feedback * Update CMakeLists.txt Co-authored-by: Nikita Titov <[email protected]> Co-authored-by: Nikita Titov <[email protected]> * added type hint (#4070) * [ci] run Dask examples on CI (#4064) * Update Parallel-Learning-Guide.rst * Update test.sh * fix path * address review comments * [python-package] add type hints on Booster.set_network() (#4068) * [python-package] add type hints on Booster.set_network() * change behavior * [python-package] Some mypy fixes (#3916) * Some mypy fixes * address James' comments * Re-introduce pass in empty classes * Update compat.py Remove extra lines * [dask] [ci] fix flaky network-setup test (#4071) * [tests][dask] simplify code in Dask tests (#4075) * simplify Dask tests code * enable CI * disable CI * Revert "[ci] prevent getting incompatible dask and distributed versions (#4054)" (#4076) This reverts commit 4e9c976. * Fix parsing of non-finite values (#3942) * Fix index out-of-range exception generated by BaggingHelper on small datasets. Prior to this change, the line "score_t threshold = tmp_gradients[top_k - 1];" would generate an exception, since tmp_gradients would be empty when the cnt input value to the function is zero. * Update goss.hpp * Update goss.hpp * Add API method LGBM_BoosterPredictForMats which runs prediction on a data set given as of array of pointers to rows (as opposed to existing method LGBM_BoosterPredictForMat which requires data given as contiguous array) * Fix incorrect upstream merge * Add link to LightGBM.NET * Fix indenting to 2 spaces * Dummy edit to trigger CI * Dummy edit to trigger CI * remove duplicate functions from merge * Fix parsing of non-finite values. Current implementation silently returns zero when input string is "inf", "-inf", or "nan" when compiled with VS2017, so instead just explicitly check for these values and fail if there is no match. No attempt to optimise string allocations in this implementation since it is usually rarely invoked. * Dummy commit to trigger CI * Also handle -nan in double parsing method * Update include/LightGBM/utils/common.h Remove trailing whitespace to pass linting tests Co-authored-by: Nikita Titov <[email protected]> Co-authored-by: matthew-peacock <[email protected]> Co-authored-by: Guolin Ke <[email protected]> Co-authored-by: Nikita Titov <[email protected]> * [dask] remove unused imports from typing (#4079) * Range check for DCG position discount lookup (#4069) * Add check to prevent out of index lookup in the position discount table. Add debug logging to report number of queries found in the data. * Change debug logging location so that we can print the data file name as well. * Revert "Change debug logging location so that we can print the data file name as well." This reverts commit 3981b34. * Add data file name to debug logging. * Move log line to a place where it is output even when query IDs are read from a separate file. * Also add the out-of-range check to rank metrics. * Perform check after number of queries is initialized. * Update * [ci] upgrade R CI scripts to work on Ubuntu 20.04 (#4084) * [ci] install additional LaTeX packages in R CI jobs * update autoconf version * bump upper limit on package size to 100 * [SWIG] Add streaming data support + cpp tests (#3997) * [feature] Add ChunkedArray to SWIG * Add ChunkedArray * Add ChunkedArray_API_extensions.i * Add SWIG class wrappers * Address some review comments * Fix linting issues * Move test to tests/test_ChunkedArray_manually.cpp * Add test note * Move ChunkedArray to include/LightGBM/utils/ * Declare more explicit types of ChunkedArray in the SWIG API. * Port ChunkedArray tests to googletest * Please C++ linter * Address StrikerRUS' review comments * Update SWIG doc & disable ChunkedArray<int64_t> * Use CHECK_EQ instead of assert * Change include order (linting) * Rename ChunkedArray -> chunked_array files * Change header guards * Address last comments from StrikerRUS * store all CMake files in one place (#4087) * v3.2.0 release (#3872) * Update VERSION.txt * update appveyor.yml and configure * fix Appveyor builds Co-authored-by: James Lamb <[email protected]> Co-authored-by: Nikita Titov <[email protected]> Co-authored-by: StrikerRUS <[email protected]> * [ci] Bump version for development (#4094) * Update .appveyor.yml * Update cran-comments.md * Update VERSION.txt * update configure Co-authored-by: James Lamb <[email protected]> * [ci] fix flaky Azure Pipelines jobs (#4095) * Update test.sh * Update setup.sh * Update .vsts-ci.yml * Update test.sh * Update setup.sh * Update .vsts-ci.yml * Update setup.sh * Update setup.sh Co-authored-by: Subham Agrawal <[email protected]> Co-authored-by: James Lamb <[email protected]> Co-authored-by: shiyu1994 <[email protected]> Co-authored-by: Nikita Titov <[email protected]> Co-authored-by: jmoralez <[email protected]> Co-authored-by: marcelonieva7 <[email protected]> Co-authored-by: Philip Hyunsu Cho <[email protected]> Co-authored-by: Deddy Jobson <[email protected]> Co-authored-by: Alberto Ferreira <[email protected]> Co-authored-by: mjmckp <[email protected]> Co-authored-by: matthew-peacock <[email protected]> Co-authored-by: Guolin Ke <[email protected]> Co-authored-by: ashok-ponnuswami-msft <[email protected]> Co-authored-by: StrikerRUS <[email protected]> * Apply suggestions from code review Co-authored-by: Nikita Titov <[email protected]> Co-authored-by: James Lamb <[email protected]> Co-authored-by: Subham Agrawal <[email protected]> Co-authored-by: shiyu1994 <[email protected]> Co-authored-by: Nikita Titov <[email protected]> Co-authored-by: jmoralez <[email protected]> Co-authored-by: marcelonieva7 <[email protected]> Co-authored-by: Philip Hyunsu Cho <[email protected]> Co-authored-by: Deddy Jobson <[email protected]> Co-authored-by: Alberto Ferreira <[email protected]> Co-authored-by: mjmckp <[email protected]> Co-authored-by: matthew-peacock <[email protected]> Co-authored-by: Guolin Ke <[email protected]> Co-authored-by: ashok-ponnuswami-msft <[email protected]> Co-authored-by: StrikerRUS <[email protected]>

github-actions · 2023-08-23T23:58:17Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

matthew-peacock and others added 16 commits November 2, 2018 13:07

Fix index out-of-range exception generated by BaggingHelper on small …

c87b8e2

…datasets. Prior to this change, the line "score_t threshold = tmp_gradients[top_k - 1];" would generate an exception, since tmp_gradients would be empty when the cnt input value to the function is zero.

Update goss.hpp

48b6e2c

Update goss.hpp

182e924

Merge remote-tracking branch 'upstream/master'

7d81cdb

Add API method LGBM_BoosterPredictForMats which runs prediction on a …

c3ab42f

…data set given as of array of pointers to rows (as opposed to existing method LGBM_BoosterPredictForMat which requires data given as contiguous array)

Merge branch 'master' of https://github.com/mjmckp/LightGBM

fa06bf6

Fix incorrect upstream merge

fbf8637

Add link to LightGBM.NET

84c3af8

Fix indenting to 2 spaces

af9c942

Dummy edit to trigger CI

c75f756

Merge remote-tracking branch 'upstream/master'

f81ed85

Dummy edit to trigger CI

25a7a29

Merge remote-tracking branch 'upstream/master'

5562014

Merge remote-tracking branch 'upstream/master'

e29ae45

remove duplicate functions from merge

6272a30

mjmckp requested review from btrotta, chivee and guolinke as code owners February 12, 2021 03:17

Dummy commit to trigger CI

108c6de

StrikerRUS added the fix label Feb 12, 2021

AlbertoEAF suggested changes Feb 13, 2021

View reviewed changes

AlbertoEAF reviewed Feb 13, 2021

View reviewed changes

matthew-peacock added 2 commits February 14, 2021 13:53

Also handle -nan in double parsing method

be702c0

Merge branch 'fix-doubleparsing' of https://github.com/mjmckp/LightGBM …

1aea9d9

…into fix-doubleparsing

StrikerRUS reviewed Feb 18, 2021

View reviewed changes

include/LightGBM/utils/common.h Outdated Show resolved Hide resolved

Update include/LightGBM/utils/common.h

4f06459

Remove trailing whitespace to pass linting tests Co-authored-by: Nikita Titov <[email protected]>

AlbertoEAF mentioned this pull request Mar 6, 2021

v3.2.0 release #3872

Merged

jameslamb added the awaiting review label Mar 12, 2021

guolinke approved these changes Mar 16, 2021

View reviewed changes

StrikerRUS removed the awaiting review label Mar 16, 2021

StrikerRUS merged commit 4e47c93 into microsoft:master Mar 16, 2021

shiyu1994 mentioned this pull request Mar 24, 2021

Precise text file parsing #4081

Merged

StrikerRUS mentioned this pull request Mar 26, 2021

inf/nan values in model are silently converted to zero when model is round-tripped to file #3941

Closed

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix parsing of non-finite values #3942

Fix parsing of non-finite values #3942

mjmckp commented Feb 12, 2021

StrikerRUS commented Feb 12, 2021

AlbertoEAF left a comment

AlbertoEAF Feb 13, 2021

mjmckp Feb 14, 2021

AlbertoEAF Feb 14, 2021

AlbertoEAF Feb 13, 2021

mjmckp Feb 14, 2021

AlbertoEAF Feb 13, 2021 •

edited

Loading

AlbertoEAF Feb 13, 2021

mjmckp Feb 14, 2021

AlbertoEAF Feb 14, 2021

mjmckp Feb 18, 2021

StrikerRUS Mar 12, 2021

mjmckp Mar 12, 2021

StrikerRUS Mar 13, 2021

StrikerRUS Mar 16, 2021

mjmckp Mar 16, 2021

StrikerRUS commented Mar 12, 2021

guolinke left a comment

github-actions bot commented Aug 23, 2023

Fix parsing of non-finite values #3942

Fix parsing of non-finite values #3942

Conversation

mjmckp commented Feb 12, 2021

StrikerRUS commented Feb 12, 2021

AlbertoEAF left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlbertoEAF Feb 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS commented Mar 12, 2021

guolinke left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 23, 2023

AlbertoEAF Feb 13, 2021 •

edited

Loading